21:21
2026-06-18
lesswrong.com
ai-safety
The distillation double bind: Distilling misaligned models either transfers misalignment or it doesn't
Researchers propose distillation techniques to transfer AI capabilities without transferring misalignment, leveraging a double bind where failed incrimination distillation reduces misalignment risk. Tโฆ